A Consistent and Efficient Estimator for Data-Oriented Parsing
نویسندگان
چکیده
Given a sequence of samples from an unknown probability distribution, a statistical estimator aims at providing an approximate guess of the distribution by utilizing statistics from the samples. One crucial property of a ‘good’ estimator is that its guess approaches the unknown distribution as the sample sequence grows large. This property is called consistency. This paper concerns estimators for natural language parsing under the DataOriented Parsing (DOP) model. The DOP model specifies how a probabilistic grammar is acquired from statistics over a given training treebank, a corpus of sentence-parse pairs. Recently, Johnson [15] showed that the DOP estimator (called DOP1) is biased and inconsistent. A second relevant problem with DOP1 is that it suffers from an overwhelming computational inefficiency. This paper presents the first (nontrivial) consistent estimator for the DOP model. The new estimator is based on a combination of held-out estimation and a bias toward parsing with shorter derivations. To justify the need for a biased estimator in the case of DOP, we prove that every non-overfitting DOP estimator is statistically biased. Our choice for the bias toward shorter derivations is justified by empirical experience, mathematical convenience and efficiency considerations. In support of our theoretical results of consistency and computational efficiency, we also report experimental results with the new estimator.
منابع مشابه
A Consistent and Efficient Estimator for the Data-oriented Parsing Model
Given a sequence of samples from an unknown probability distribution, a statistical estimator aims at providing an approximate guess of the distribution by utilizing statistics from the samples. One desired property of an estimator is that its guess approaches the unknown distribution as the sample sequence grows large. Mathematically speaking, this property is called consistency. This thesis p...
متن کاملA Consistent Estimator for Uniform Parameter Under Interval Censoring
The censored data are widely used in statistical tests and parameters estimation. In some cases e.g. medical accidents which data are not recorded at the time of occurrence, some methods such as interval censoring are used. In this paper, for a random sample uniformly distributed on the interval (0,θ) the interval censoring have been used. A consistent estimator of θ and some asy...
متن کاملData-Oriented Parsing
1. A DOP model for phrase-structure trees R. Bod and R. Scha 2. Probability models for DOP R. Bonnema 3. Encoding frequency information in stochastic parsing models 1. Computational complexity of disambiguation under DOP K. Sima'an 2. Parsing DOP with Monte Carlo techniques J. Chappelier and M. Rajman 3. Towards efficient Monte Carlo parsing R. Bonnema 4. Efficient parsing of DOP with PCFG-redu...
متن کاملEstimation of E(Y) from a Population with Known Quantiles
‎In this paper‎, ‎we consider the problem of estimating E(Y) based on a simple random sample when at least one of the population quantiles is known‎. ‎We propose a stratified estimator of E(Y)‎, ‎and show that it is strongly consistent‎. ‎We then establish the asymptotic normality of the suggested estimator‎, ‎and prove that it ...
متن کاملBack-off as Parameter Estimation for DOP models
Data-Oriented Parsing (DOP) is a probabilistic performance approach to parsing natural language. Several DOP models have been proposed since it was introduced by Scha (1990), achieving promising results. One important feature of these models is the probability estimation procedure. Two major estimators have been put forward: Bod (1993) uses a relative frequency estimator; Bonnema (1999) adds a ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Automata, Languages and Combinatorics
دوره 10 شماره
صفحات -
تاریخ انتشار 2005